Skip to main content

System Design Refresher

Table of Contents

1. Networking & Communication

2. Storage & Databases

3. Scalability & Reliability

4. System Design Patterns

5. Advanced Caching

6. Observability

7. Security & Privacy

8. Infrastructure & Deployment

9. Special Topics

10. Additional Important Topics

Interview Preparation

Quick Reference


1. Networking & Communication

IP Addressing

  • IPv4: 32-bit addresses (e.g., 192.168.1.1), supports ~4.3 billion addresses
  • IPv6: 128-bit addresses, designed to solve IPv4 exhaustion
  • Private vs Public IPs: Private IPs (10.x.x.x, 192.168.x.x) for internal networks, public for internet-facing
  • CIDR Notation: 192.168.1.0/24 means first 24 bits are network, last 8 for hosts

DNS (Domain Name System)

  • Translates domain names to IP addresses
  • Hierarchical system: Root → TLD (.com) → Authoritative nameserver
  • Record types:
    • A: Maps domain to IPv4
    • AAAA: Maps to IPv6
    • CNAME: Alias to another domain
    • MX: Mail server
    • NS: Nameserver
  • DNS caching: Browsers, OS, recursive resolvers cache results (TTL-based)
  • Interview tip: DNS is often a single point of failure; use multiple nameservers

Load Balancers

L4 (Layer 4 - Transport Layer)

  • Operates at TCP/UDP level
  • Routes based on IP address and port
  • Faster, less inspection overhead
  • Cannot route based on content (URL, headers)
  • Use case: High-throughput, low-latency requirements

L7 (Layer 7 - Application Layer)

  • Operates at HTTP/HTTPS level
  • Routes based on URLs, headers, cookies
  • Can do SSL termination, content-based routing
  • More CPU intensive
  • Use case: Microservices with different endpoints, A/B testing

Proxy

Forward Proxy

  • Client-side proxy
  • Client → Forward Proxy → Internet
  • Use cases:
    • Content filtering in organizations
    • Anonymity (VPN-like behavior)
    • Caching for clients
  • Example: Corporate proxy server

Reverse Proxy

  • Server-side proxy
  • Client → Reverse Proxy → Backend servers
  • Use cases:
    • Load balancing
    • SSL termination
    • Caching
    • Security (hide backend infrastructure)
  • Examples: Nginx, HAProxy

TCP vs UDP

FeatureTCPUDP
ConnectionConnection-oriented (3-way handshake)Connectionless
ReliabilityGuaranteed delivery, orderedNo guarantee, may lose/reorder packets
SpeedSlower (overhead)Faster
Use casesHTTP, SSH, File transfersVideo streaming, DNS, Gaming, VoIP
Flow controlYes (prevents overwhelming receiver)No
Error checkingExtensiveBasic checksum

Interview insight: TCP trades speed for reliability; UDP trades reliability for speed

HTTP/HTTPS Basics

HTTP Methods

  • GET: Retrieve resource (idempotent, cacheable)
  • POST: Create resource (not idempotent)
  • PUT: Update/replace entire resource (idempotent)
  • PATCH: Partial update (not necessarily idempotent)
  • DELETE: Remove resource (idempotent)
  • HEAD: Like GET but without response body
  • OPTIONS: Check available methods

Important Headers

  • Cache-Control: Directives for caching (max-age, no-cache, no-store)
  • ETag: Resource version identifier for conditional requests
  • Authorization: Bearer tokens, API keys
  • Content-Type: MIME type (application/json, text/html)
  • User-Agent: Client information
  • Accept: Content types client can process

HTTP Status Codes

  • 2xx: Success (200 OK, 201 Created, 204 No Content)
  • 3xx: Redirection (301 Moved Permanently, 304 Not Modified)
  • 4xx: Client errors (400 Bad Request, 401 Unauthorized, 404 Not Found, 429 Too Many Requests)
  • 5xx: Server errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable)

REST vs GraphQL vs gRPC

REST

  • Resource-based URLs (/users/123)
  • Standard HTTP methods
  • Over-fetching/under-fetching possible
  • Easy caching via HTTP
  • Best for: Public APIs, CRUD operations

GraphQL

  • Single endpoint (/graphql)
  • Client specifies exact data needed
  • Reduces over-fetching
  • More complex server-side
  • Best for: Complex data requirements, mobile apps (bandwidth concerns)

gRPC

  • Uses Protocol Buffers (binary format)
  • HTTP/2 based, bidirectional streaming
  • Strongly typed contracts
  • More efficient than JSON
  • Best for: Internal microservices, high-performance requirements

Real-Time Communication

WebSockets

  • Full-duplex communication over single TCP connection
  • Persistent connection (after HTTP upgrade)
  • Low latency, real-time bidirectional
  • Use cases: Chat apps, live trading, multiplayer games
  • Trade-off: Stateful, harder to scale (requires sticky sessions)

Long Polling

  • Client requests, server holds connection until data available
  • Then responds, client immediately requests again
  • More overhead than WebSockets but better compatibility
  • Use cases: Real-time updates where WebSockets unavailable

Server-Sent Events (SSE)

  • Server pushes updates to client over HTTP
  • Unidirectional (server → client)
  • Auto-reconnects, built-in event IDs
  • Use cases: News feeds, stock tickers, notifications
  • Trade-off: Only server-to-client (unlike WebSockets)

CDN (Content Delivery Network)

  • Distributed servers at edge locations (geographically closer to users)
  • Benefits:
    • Reduced latency (geographic proximity)
    • Lower bandwidth costs
    • DDoS protection
    • High availability
  • Edge caching: Static assets cached at CDN edges
  • Geo-replication: Content replicated across regions
  • Invalidation: Can purge/update cached content
  • Examples: CloudFlare, Akamai, CloudFront

2. Storage & Databases

Relational Databases (SQL)

ACID Properties

  • Atomicity: All operations in transaction succeed or all fail (no partial states)
  • Consistency: Database remains in valid state (constraints honored)
  • Isolation: Concurrent transactions don't interfere
  • Durability: Committed data persists even after crashes

Isolation Levels

  • Read Uncommitted: Can read uncommitted changes (dirty reads)
  • Read Committed: Only reads committed data (default in many DBs)
  • Repeatable Read: Same query returns same results in transaction
  • Serializable: Strongest isolation, transactions fully isolated

Normalization

  • 1NF: Atomic values, no repeating groups
  • 2NF: 1NF + no partial dependencies (all non-key attributes depend on entire primary key)
  • 3NF: 2NF + no transitive dependencies
  • Trade-off: More normalized = less redundancy but more joins; denormalization for read performance

Database Replication Patterns

Master-Slave (Primary-Replica)

  • Write: Goes to master only
  • Read: Can be served from slaves/replicas
  • Pros: Scales reads, simple architecture
  • Cons: Master is single point of failure for writes, replication lag
  • Use case: Read-heavy workloads (90% reads, 10% writes)

Master-Master (Multi-Master)

  • Write: Can go to any master
  • Read: From any master
  • Pros: No single point of failure for writes, better write scaling
  • Cons: Complex conflict resolution, potential data inconsistencies
  • Conflict resolution: Last-write-wins, version vectors, custom logic
  • Use case: Globally distributed systems, high write availability

NoSQL Databases

Key-Value Stores

  • Structure: Simple key → value mapping
  • Examples: Redis, DynamoDB, Riak
  • Pros: Extremely fast, simple, horizontally scalable
  • Cons: Limited query capabilities (no joins, no complex queries)
  • Use cases: Session storage, caching, shopping carts

Document Databases

  • Structure: Store JSON-like documents
  • Examples: MongoDB, CouchDB, Firestore
  • Pros: Flexible schema, good for hierarchical data, can index/query fields
  • Cons: No joins (embed or reference), eventual consistency
  • Use cases: Content management, user profiles, catalogs

Wide-Column Stores

  • Structure: Column families, rows can have different columns
  • Examples: Cassandra, HBase, BigTable
  • Pros: Efficient for sparse data, scales horizontally, fast writes
  • Cons: Complex data modeling, eventual consistency
  • Use cases: Time-series data, IoT sensors, event logging

Graph Databases

  • Structure: Nodes, edges, properties
  • Examples: Neo4j, JanusGraph, Amazon Neptune
  • Pros: Efficient for relationship queries, natural for connected data
  • Cons: Harder to scale horizontally, specialized use cases
  • Use cases: Social networks, recommendation engines, fraud detection

Database Indexes

B-Tree Indexes

  • Structure: Balanced tree, sorted data
  • Pros: Good for range queries, ordered data
  • Operations: O(log n) for search, insert, delete
  • Use case: Default index in most SQL databases
  • Example: WHERE age BETWEEN 25 AND 35

Hash Indexes

  • Structure: Hash table
  • Pros: O(1) for exact match lookups
  • Cons: No range queries, no ordering
  • Use case: Equality comparisons only
  • Example: WHERE user_id = 123

Inverted Indexes

  • Structure: Maps terms to document IDs containing them
  • Used in: Full-text search engines
  • Example:
    • Doc1: "quick brown fox"
    • Doc2: "brown dog"
    • Index: "brown" → [Doc1, Doc2]
  • Use case: Search functionality (Elasticsearch)

Geospatial Indexes

  • Types:
    • R-tree: For spatial data (rectangles, polygons)
    • Quadtree: Divides space into quadrants recursively
    • Geohash: Encodes lat/long into string
  • Use cases: "Find restaurants within 5km", location-based services
  • Examples: MongoDB geospatial, PostGIS

Replication & Sharding

Horizontal Scaling (Sharding)

  • Definition: Distribute data across multiple machines
  • Sharding strategies:
    • Hash-based: hash(key) % num_shards
    • Range-based: user_id 1-1M on shard1, 1M-2M on shard2
    • Geography-based: EU users on EU shard, US users on US shard
    • Directory-based: Lookup table maps keys to shards
  • Challenges:
    • Cross-shard joins expensive
    • Rebalancing shards when adding nodes
    • Choosing good shard key (avoid hotspots)

Vertical Scaling

  • Definition: Add more resources (CPU, RAM) to single machine
  • Pros: Simpler (no distributed complexity)
  • Cons: Limited by hardware limits, expensive, single point of failure
  • When to use: Before reaching limits, for databases requiring strong consistency

CAP Theorem

You can only have 2 of 3: Consistency, Availability, Partition tolerance

  • Consistency: All nodes see same data at same time
  • Availability: Every request gets response (success/failure)
  • Partition tolerance: System works despite network partitions

In practice: Network partitions will happen, so choose between:

  • CP (Consistency + Partition tolerance): Sacrifice availability during partition
    • Examples: HBase, MongoDB (strong consistency mode)
    • Use case: Financial systems, inventory
  • AP (Availability + Partition tolerance): Sacrifice consistency during partition
    • Examples: Cassandra, DynamoDB, Riak
    • Use case: Social media feeds, analytics

PACELC Theorem (more realistic):

  • If Partition, choose A or C
  • Else (no partition), choose Latency or Consistency

Query Optimization

Techniques

  1. Indexes: Most critical (but adds write overhead)
  2. Query analysis: Use EXPLAIN to see execution plan
  3. Avoid SELECT *: Fetch only needed columns
  4. Limit result sets: Pagination, WHERE clauses
  5. Denormalization: For read-heavy workloads
  6. Partitioning: Split large tables
  7. Connection pooling: Reuse database connections
  8. Caching: Redis for frequently accessed data

Common Issues

  • N+1 queries: Fetching related data in loop (use joins or batch fetches)
  • Full table scans: Missing indexes on WHERE/JOIN columns
  • Suboptimal joins: Wrong join order or type

CDC (Change Data Capture)

  • Purpose: Track changes in database (inserts, updates, deletes)
  • Methods:
    • Log-based: Read database transaction logs (MySQL binlog, Postgres WAL)
    • Trigger-based: Database triggers on changes
    • Timestamp-based: Check last_modified column
  • Use cases:
    • Data replication to data warehouse
    • Invalidating caches
    • Event-driven architectures
  • Tools: Debezium, Maxwell, AWS DMS
  • Problem: SQL LIKE '%keyword%' is slow (can't use indexes)
  • Solution: Specialized search engines with inverted indexes
  • Features:
    • Tokenization (breaking text into terms)
    • Stemming (run, running → run)
    • Relevance scoring (TF-IDF, BM25)
    • Fuzzy matching (typo tolerance)
  • Examples: Elasticsearch, Solr, Algolia
  • Architecture: Separate search cluster, sync from primary DB via CDC

Caching Strategies

Cache Levels

  1. Client-side: Browser cache, mobile app cache
  2. CDN: Edge servers cache static assets
  3. Reverse proxy: Nginx caches responses
  4. Application cache: In-memory (Redis, Memcached)
  5. Database cache: Query result cache, buffer pool

Cache Patterns

Cache-Aside (Lazy Loading)

1. Check cache
2. If miss: fetch from DB, populate cache
3. Return data
  • Pros: Only caches requested data
  • Cons: Cache miss penalty, potential stale data

Read-Through

  • Cache sits between app and DB
  • Cache handles DB fetching automatically
  • Pros: Simpler app code
  • Cons: Cache miss still slow

Write-Through

  • Writes go to cache and DB synchronously
  • Pros: Cache always consistent
  • Cons: Higher write latency

Write-Behind (Write-Back)

  • Writes go to cache, asynchronously written to DB
  • Pros: Low write latency
  • Cons: Risk of data loss, complex

Write-Around

  • Writes go directly to DB, bypass cache
  • Pros: Avoids cache pollution from writes
  • Cons: Cache miss on next read

Cache Eviction Policies

  • LRU (Least Recently Used): Evict oldest accessed item (good general purpose)
  • LFU (Least Frequently Used): Evict least accessed item (good for stable access patterns)
  • FIFO (First In First Out): Evict oldest item (simple but not optimal)
  • TTL (Time To Live): Evict after fixed time (good for time-sensitive data)
  • Random: Evict random item (simple, surprisingly effective)

Cache Stampede (Thundering Herd)

  • Problem: Cache expires, multiple requests hit DB simultaneously
  • Solutions:
    • Lock on cache miss (first request fetches, others wait)
    • Probabilistic early expiration
    • Background refresh before expiration

3. Scalability & Reliability

Load Balancing Algorithms

Round Robin

  • Distribute requests sequentially across servers
  • Pros: Simple, fair distribution
  • Cons: Doesn't consider server load/capacity
  • Weighted Round Robin: Assign more requests to powerful servers

Least Connections

  • Send to server with fewest active connections
  • Pros: Better for long-lived connections
  • Cons: Requires tracking connection state

Consistent Hashing

  • Hash both requests and servers onto ring
  • Request goes to next clockwise server
  • Pros: Minimal redistribution when adding/removing servers
  • Cons: Can create hotspots
  • Solution: Virtual nodes (multiple positions per server)
  • Use case: Distributed caches, sharding

IP Hash

  • Hash client IP to determine server
  • Pros: Same client always goes to same server (session affinity)
  • Cons: Uneven distribution if IPs not diverse

Least Response Time

  • Send to server with fastest response
  • Pros: Adapts to server performance
  • Cons: Requires health checks, more complex

Rate Limiting & Throttling

Why Rate Limit?

  • Prevent abuse/DoS attacks
  • Fair resource allocation
  • Cost control (API quotas)
  • Ensure quality of service

Algorithms

Token Bucket

  • Bucket holds tokens (refilled at fixed rate)
  • Request consumes token
  • Pros: Handles bursts, smooth rate
  • Cons: More complex
  • Example: AWS API Gateway

Leaky Bucket

  • Requests enter bucket, leak out at fixed rate
  • Pros: Smooth output rate
  • Cons: No burst handling
  • Use case: Network traffic shaping

Fixed Window

  • Allow N requests per time window (e.g., 100/hour)
  • Pros: Simple
  • Cons: Burst at window boundaries (200 requests in 1 second if split across windows)

Sliding Window Log

  • Track timestamp of each request
  • Count requests in sliding time window
  • Pros: Accurate, no boundary burst
  • Cons: Memory intensive (store all timestamps)

Sliding Window Counter

  • Combines fixed window + weighted previous window
  • Pros: Accurate, memory efficient
  • Cons: Slightly complex

Implementation

  • Storage: Redis (INCR with EXPIRE)
  • Response: 429 Too Many Requests + Retry-After header
  • Distributed: Use centralized Redis, not in-memory (consistent across servers)

Message Queues & Streams

Use Cases

  • Decoupling: Producers/consumers don't need to know about each other
  • Async processing: Handle time-consuming tasks
  • Load leveling: Queue absorbs traffic spikes
  • Reliability: Messages persist until processed

Kafka

  • Model: Distributed log (append-only)
  • Key features:
    • High throughput (millions msgs/sec)
    • Partitions for parallelism
    • Persistent storage
    • Consumer groups
    • Replay capability (seek to offset)
  • Use cases: Event streaming, log aggregation, real-time analytics

RabbitMQ

  • Model: Traditional message broker
  • Key features:
    • Multiple exchange types (direct, topic, fanout)
    • Acknowledgments
    • Priority queues
    • Dead letter queues
  • Use cases: Task queues, RPC, routing

AWS SQS

  • Model: Managed queue service
  • Types:
    • Standard: At-least-once delivery, best-effort ordering
    • FIFO: Exactly-once, strict ordering
  • Features: Auto-scaling, dead letter queues, visibility timeout
  • Use cases: Decoupling microservices, job queues

Backpressure

  • Problem: Slow consumers can't keep up with producers
  • Solutions:
    • Push-back to producers (reject requests)
    • Dynamic batching
    • Increase consumer parallelism
    • Drop messages (if acceptable)

Consumer Groups

  • Concept: Multiple consumers in group share message processing
  • Kafka: Each partition assigned to one consumer in group (parallelism = partition count)
  • Benefits: Horizontal scaling, fault tolerance

Leader Election

Why?

  • Ensure single coordinator in distributed system
  • Prevent split-brain scenarios
  • Coordinate distributed operations

Algorithms

Raft

  • Leader elected via voting
  • Heartbeats maintain leadership
  • Log replication for state machine
  • Pros: Understandable, proven
  • Used in: etcd, Consul

Paxos

  • Consensus via proposers, acceptors, learners
  • Pros: Theoretically sound
  • Cons: Complex to implement
  • Used in: Google Chubby

ZooKeeper (ZAB protocol)

  • Centralized coordination service
  • Sequential consistency
  • Use cases: Configuration management, leader election, distributed locks
  • Drawback: Single point of failure (mitigated by quorum)

Failover & Redundancy

Active-Passive (Master-Standby)

  • Setup: One active server, one standby
  • Failover: Standby takes over if active fails
  • Pros: Simpler, no split-brain risk
  • Cons: Wasted resources (standby idle)
  • Use case: Databases, critical services

Active-Active (Multi-Master)

  • Setup: All servers handle traffic
  • Pros: Better resource utilization, no failover delay
  • Cons: Complex conflict resolution, data sync
  • Use case: Stateless services, CDNs

Health Checks

  • Types:
    • Passive: Monitor logs/metrics
    • Active: Periodic pings/HTTP checks
  • Considerations: Check interval vs false positives, cascading failures

4. System Design Patterns

Read vs Write-Heavy Systems

Read-Heavy Optimization

  • Caching: Aggressive caching (Redis, CDN)
  • Read replicas: Multiple database replicas
  • Denormalization: Duplicate data to avoid joins
  • Indexing: Optimize for common queries
  • Examples: Social media feeds, news sites, e-commerce browsing

Write-Heavy Optimization

  • Write buffering: Queue writes, batch inserts
  • Asynchronous processing: Background workers
  • Eventual consistency: Accept temporary inconsistency
  • Sharding: Distribute writes across nodes
  • Optimize indexes: Fewer indexes (faster writes, slower reads)
  • Examples: IoT data ingestion, logging systems, analytics

CQRS (Command Query Responsibility Segregation)

Concept

  • Separate models for reads (queries) and writes (commands)
  • Different databases optimized for each

Write Side (Command)

  • Handles business logic
  • Validates and processes commands
  • Emits events

Read Side (Query)

  • Optimized for queries (denormalized views)
  • Updated via events from write side
  • Eventually consistent

Benefits

  • Independent scaling of reads/writes
  • Optimized data models for each
  • Clear separation of concerns

Drawbacks

  • Increased complexity
  • Eventual consistency challenges
  • Need to sync read models

Use Cases

  • Complex domains (e-commerce, banking)
  • Different read/write patterns
  • Multiple read models (different views of data)

Event Sourcing

Concept

  • Store events (state changes) instead of current state
  • Rebuild state by replaying events

Example

Events:
1. AccountCreated(id:123, balance:0)
2. MoneyDeposited(id:123, amount:100)
3. MoneyWithdrawn(id:123, amount:30)

Current state: balance = 70 (derived from events)

Benefits

  • Complete audit trail
  • Time travel (replay to any point)
  • Debugging (see what happened)
  • Multiple projections from same events

Drawbacks

  • Complex queries (need to replay events)
  • Storage growth
  • Schema evolution challenges

Combined with CQRS

  • Events from write side → update read models
  • Perfect synergy

Caching Patterns (Revisited)

Write-Through Caching

  • Flow: Write to cache → sync write to DB
  • Pros: Cache always up-to-date
  • Cons: Higher write latency (dual writes)
  • Use case: Read-heavy with some writes

Write-Back (Write-Behind) Caching

  • Flow: Write to cache → async write to DB (batched)
  • Pros: Fast writes, batch optimization
  • Cons: Data loss risk, complexity
  • Use case: High write throughput (logs, analytics)

Write-Around Caching

  • Flow: Write directly to DB, invalidate/bypass cache
  • Pros: Doesn't pollute cache with write data
  • Cons: Next read will be cache miss
  • Use case: Write-once, read-rarely data

Idempotency & Retries

Idempotency

  • Definition: Same operation can be applied multiple times without changing result
  • Examples:
    • Idempotent: DELETE /user/123 (same result each time)
    • Not idempotent: POST /user (creates new user each time)

Why Important?

  • Network failures require retries
  • Distributed systems need duplicate handling
  • Prevent double-charging, duplicate records

Implementation

  • Idempotency keys: Client generates unique ID per request
    POST /payment
    Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
  • Server stores: key → result mapping
  • On retry: Return cached result if key exists

Retry Strategies

  • Exponential backoff: Wait 1s, 2s, 4s, 8s... (with jitter)
  • Circuit breaker: Stop retrying after threshold (prevent cascading failures)
  • Deadline propagation: Don't retry if deadline exceeded

Consistency Models

Strong Consistency (Linearizability)

  • Reads always return latest write
  • Pros: Simple reasoning, no surprises
  • Cons: Higher latency, lower availability
  • Examples: Traditional RDBMS, ZooKeeper
  • Use case: Financial transactions, inventory

Eventual Consistency

  • Reads may return stale data temporarily
  • Eventually all replicas converge
  • Pros: High availability, low latency
  • Cons: Complex application logic
  • Examples: DynamoDB, Cassandra, DNS
  • Use case: Social media, analytics

Causal Consistency

  • Causally-related operations seen in order
  • Concurrent operations can be seen in any order
  • Example:
    • Post message → Like message (causal, must be ordered)
    • Two likes from different users (concurrent, any order OK)
  • Use case: Collaborative applications

Read-Your-Writes Consistency

  • User always sees their own updates
  • Others may have delay
  • Implementation: Route user's reads to same replica

Monotonic Reads

  • If user sees value, subsequent reads won't see older value
  • Prevents: Reading from lagging replica after reading from up-to-date one

5. Advanced Caching

CDN Deep Dive

  • Push CDN: Origin server pushes content to edge servers proactively

    • Pros: Lower latency, predictable
    • Cons: Wasted bandwidth for unpopular content
    • Use case: Popular content known in advance
  • Pull CDN: Edge servers pull content on-demand (cache-aside)

    • Pros: Only cache requested content
    • Cons: First request slow (cold cache)
    • Use case: Long-tail content distribution

Edge Computing

  • Run compute at edge locations (not just caching)
  • Use cases:
    • A/B testing at edge
    • Authentication/authorization
    • Request manipulation
    • Serverless functions
  • Examples: CloudFlare Workers, Lambda@Edge

Redis Advanced Patterns

Redis as Message Broker

  • Pub/Sub for real-time messaging
  • Streams for event sourcing
  • Pros: Fast, simple
  • Cons: No message persistence (pub/sub), less feature-rich than Kafka

Redis as Database

  • Persistence options: RDB snapshots, AOF logs
  • Use cases: Session store, leaderboards, rate limiting

Redis Data Structures

  • Strings, Lists, Sets, Sorted Sets, Hashes
  • HyperLogLog (cardinality estimation)
  • Bitmaps (user activity tracking)
  • Geospatial indexes

Application-Level Caching

In-Memory Caching

  • Libraries: Caffeine (Java), Go-cache, lru-cache (Node.js)
  • Pros: Ultra-fast (no network)
  • Cons: Not shared across servers, memory limited

Distributed Caching

  • Examples: Redis, Memcached
  • Pros: Shared state, larger capacity
  • Cons: Network latency, failure modes

Multi-Level Caching

RequestL1 (in-memory)L2 (Redis)L3 (Database)
  • Benefits: Balance speed and size
  • Invalidation: Coordinate across levels

Cache Invalidation Strategies

Time-based (TTL)

  • Simplest, works well for slowly-changing data
  • Risk: Serving stale data until expiration

Event-based

  • Invalidate on data changes
  • Methods: Pub/Sub, CDC, explicit invalidation
  • Pros: Always fresh data
  • Cons: Complexity, potential race conditions

Write-through/Write-behind

  • Update cache on writes
  • Pros: Cache always current
  • Cons: Write overhead

6. Observability

Backoff, Jitter, and Retry Strategies

Exponential Backoff

delay = base_delay * (2 ^ attempt)
Example: 1s, 2s, 4s, 8s, 16s...
  • Problem: Thundering herd (all clients retry simultaneously after backoff)

Jitter

delay = base_delay * (2 ^ attempt) * random(0.5, 1.5)
  • Benefit: Spreads out retries, prevents synchronized thundering herd
  • Types:
    • Full jitter: Random between 0 and max delay
    • Equal jitter: Half fixed, half random
    • Decorrelated jitter: Each attempt varies independently

Circuit Breaker Pattern

States:

  1. Closed: Normal operation, requests pass through
  2. Open: Errors exceed threshold, block requests immediately (fail fast)
  3. Half-Open: After timeout, allow test request
    • Success → Close circuit
    • Failure → Reopen circuit

Benefits:

  • Prevent cascading failures
  • Give downstream service time to recover
  • Fast failure instead of waiting for timeout

Logging Best Practices

Structured Logging

{
"timestamp": "2025-10-02T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc-123",
"user_id": "456",
"message": "Payment processing failed",
"error": "insufficient funds"
}
  • Benefits: Machine-parseable, searchable, aggregatable

Log Levels

  • ERROR: Failures requiring immediate attention
  • WARN: Unexpected but handled (retry success, degraded mode)
  • INFO: Important business events (user signup, payment)
  • DEBUG: Detailed diagnostic info (development only)

Correlation IDs

  • Unique ID per request, propagated across services
  • Benefit: Trace request flow through distributed system
  • Implementation: X-Request-ID header

What NOT to Log

  • Passwords, API keys, PII (privacy/security)
  • Excessive debugging in production (cost/noise)
  • Every database query (performance)

Monitoring & Metrics

Key Metrics (RED Method)

  • Rate: Requests per second
  • Errors: Error rate/count
  • Duration: Latency (p50, p95, p99)

USE Method (for Resources)

  • Utilization: % time resource busy (CPU, memory)
  • Saturation: Queue length (requests waiting)
  • Errors: Error count

Golden Signals (Google SRE)

  • Latency: Time to serve request
  • Traffic: Request volume
  • Errors: Failed request rate
  • Saturation: System fullness (how close to capacity)

Metric Types

  • Counter: Monotonically increasing (requests_total)
  • Gauge: Current value (cpu_usage, active_connections)
  • Histogram: Distribution of values (request_duration_seconds)
  • Summary: Like histogram but calculates quantiles

Prometheus & Grafana

  • Prometheus: Time-series database, pull-based scraping
    • PromQL query language
    • Alert manager integration
  • Grafana: Visualization dashboards
    • Multiple data sources
    • Alerting, annotations

Alerting Best Practices

Alert Fatigue Prevention

  • Alert on symptoms, not causes (user-facing issues, not low-level)
  • Make alerts actionable (clear remediation steps)
  • Avoid duplicate alerts
  • Use severity levels (critical vs warning)

On-Call Considerations

  • Runbooks: Step-by-step troubleshooting guides
  • Escalation policies: Who to notify, when
  • Postmortems: Blameless analysis after incidents

Distributed Tracing

Concept

  • Track request flow across microservices
  • Visualize latency bottlenecks
  • Identify failing service in chain

Implementation

  • Trace: Single request journey
  • Span: Individual operation (DB query, HTTP call)
  • Trace ID: Unique identifier for entire trace
  • Span ID: Unique identifier for operation

Tools

  • Jaeger: Open-source, CNCF project
  • Zipkin: Twitter-originated
  • OpenTelemetry: Vendor-neutral standard (merges OpenTracing + OpenCensus)

Example Trace

Frontend (50ms)
├─ Auth Service (10ms)
├─ Product Service (30ms)
│ └─ Database Query (25ms)Bottleneck!
└─ Payment Service (5ms)

SLO, SLI, SLA

SLI (Service Level Indicator)

  • Definition: Metric representing system health
  • Examples:
    • Request success rate
    • Request latency (p95 < 200ms)
    • System uptime

SLO (Service Level Objective)

  • Definition: Target value/range for SLI
  • Example: "99.9% of requests succeed"
  • Purpose: Internal goal for reliability

SLA (Service Level Agreement)

  • Definition: Contract with users (consequences if SLO missed)
  • Example: "99.9% uptime or customer gets refund"
  • Relationship: SLA ≤ SLO (SLO should be stricter to have buffer)

Error Budgets

  • Concept: Acceptable downtime based on SLO
  • Example: 99.9% SLO = 43 minutes downtime/month allowed
  • Usage: If budget exhausted, freeze features and focus on reliability

7. Security & Privacy

Authentication vs Authorization

Authentication

  • Definition: Verifying who the user is
  • Methods:
    • Username/password
    • Multi-factor authentication (MFA)
    • Biometrics
    • Certificate-based

Authorization

  • Definition: What the user can do
  • Models:
    • RBAC (Role-Based): User has roles (admin, editor), roles have permissions
    • ABAC (Attribute-Based): Policies based on attributes (department, clearance level)
    • ACL (Access Control List): Resource lists who can access

Authentication Mechanisms

JWT (JSON Web Tokens)

Structure: Header.Payload.Signature

{
"sub": "user123",
"name": "John Doe",
"exp": 1730000000
}
  • Pros: Stateless, self-contained, scalable
  • Cons: Can't revoke (until expiry), token size
  • Use case: API authentication, microservices

OAuth 2.0

  • Purpose: Delegated authorization (allow app access without sharing password)
  • Flow Example (Authorization Code):
    1. User clicks "Login with Google"
    2. Redirect to Google (authorization server)
    3. User approves
    4. Google redirects back with authorization code
    5. App exchanges code for access token
    6. App uses token to access Google APIs
  • Roles:
    • Resource Owner (user)
    • Client (your app)
    • Authorization Server (Google)
    • Resource Server (Google APIs)

SSO (Single Sign-On)

  • Definition: One login for multiple applications
  • Protocols: SAML, OAuth 2.0, OpenID Connect
  • Benefits: Better UX, centralized access control
  • Use case: Enterprise applications

Session-Based Authentication

  • Flow:
    1. User logs in
    2. Server creates session, stores in DB/Redis
    3. Returns session ID cookie
    4. Client sends cookie with requests
  • Pros: Can revoke immediately
  • Cons: Stateful, harder to scale (requires sticky sessions or shared session store)

Encryption

TLS/HTTPS

  • Purpose: Encrypt data in transit
  • Handshake:
    1. Client Hello (supported cipher suites)
    2. Server Hello (chosen cipher, certificate)
    3. Client verifies certificate
    4. Key exchange (establish shared secret)
    5. Encrypted communication
  • TLS 1.3: Faster handshake, stronger security

Data at Rest

  • Encryption: AES-256 (symmetric encryption)
  • Key management: HSM (Hardware Security Module), KMS (Key Management Service)
  • Database encryption:
    • Full disk encryption
    • Column-level encryption (for sensitive fields)
  • Application-level encryption: Encrypt before storing

Hashing vs Encryption

  • Hashing: One-way (passwords)
    • Use bcrypt, Argon2 (not MD5, SHA1)
    • Salt to prevent rainbow tables
  • Encryption: Two-way (reversible with key)
    • Use AES, RSA

DDoS Protection & WAF

DDoS (Distributed Denial of Service)

Types:

  • Volumetric: Flood with traffic (UDP flood, amplification attacks)
  • Protocol: Exploit protocol weaknesses (SYN flood)
  • Application Layer: Target application (HTTP flood)

Mitigation:

  • Rate limiting: Per IP, per endpoint
  • CDN: Absorb traffic at edge
  • Anycast: Distribute traffic across locations
  • Traffic analysis: Identify and block malicious patterns
  • Overprovisioning: Have excess capacity

WAF (Web Application Firewall)

  • Purpose: Filter malicious HTTP traffic
  • Protection against:
    • SQL injection
    • XSS (Cross-Site Scripting)
    • CSRF (Cross-Site Request Forgery)
    • Path traversal
  • Types:
    • Network-based (hardware appliance)
    • Host-based (integrated in app)
    • Cloud-based (CloudFlare, AWS WAF)
  • Rules: Signature-based, behavioral analysis

Data Privacy (GDPR Basics)

Key Principles

  • Lawful basis: Need consent or legitimate interest
  • Data minimization: Collect only necessary data
  • Purpose limitation: Use data only for stated purpose
  • Storage limitation: Don't keep data longer than needed
  • Accuracy: Keep data up-to-date
  • Security: Protect with encryption, access controls

User Rights

  • Right to access: User can request their data
  • Right to erasure: "Right to be forgotten" (delete data)
  • Right to portability: Export data in machine-readable format
  • Right to rectification: Correct inaccurate data

Implementation Considerations

  • Data inventory: Know what PII you collect
  • Consent management: Track and honor user consent
  • Data retention policies: Auto-delete old data
  • Breach notification: Report breaches within 72 hours
  • Privacy by design: Build privacy into systems from start

8. Infrastructure & Deployment

Containers & Orchestration

Docker

Key Concepts:

  • Image: Read-only template (base OS + app + dependencies)
  • Container: Running instance of image
  • Dockerfile: Instructions to build image
  • Layers: Each instruction creates layer (cached for efficiency)

Benefits:

  • Consistent environments (dev = prod)
  • Lightweight (vs VMs)
  • Fast startup
  • Isolated processes

Kubernetes (K8s)

Architecture:

  • Control Plane: Master node(s)
    • API Server
    • Scheduler (assigns pods to nodes)
    • Controller Manager (maintains desired state)
    • etcd (distributed config store)
  • Worker Nodes: Run pods
    • kubelet (agent)
    • kube-proxy (networking)
    • Container runtime (Docker, containerd)

Key Resources:

  • Pod: Smallest unit (1+ containers)
  • Deployment: Manages replica sets, rolling updates
  • Service: Stable endpoint for pods (load balancing)
  • ConfigMap: Configuration data
  • Secret: Sensitive data (encrypted)
  • Ingress: HTTP(S) routing to services
  • Namespace: Virtual clusters for isolation

Benefits:

  • Auto-scaling (HPA - Horizontal Pod Autoscaler)
  • Self-healing (restart failed pods)
  • Rolling updates, rollbacks
  • Service discovery
  • Storage orchestration

CI/CD Pipelines

Continuous Integration (CI)

  • Goal: Frequently merge code to main branch
  • Pipeline:
    1. Code commit triggers build
    2. Run tests (unit, integration)
    3. Static analysis (linting, security scans)
    4. Build artifacts (Docker images)
  • Benefits: Catch bugs early, reduce merge conflicts

Continuous Deployment (CD)

  • Goal: Automatically deploy to production
  • Pipeline:
    1. Successful CI build
    2. Deploy to staging
    3. Run E2E tests
    4. Deploy to production (if tests pass)

Deployment Strategies

Blue-Green Deployment

  • Setup: Two identical environments (Blue = current, Green = new)
  • Process:
    1. Deploy new version to Green
    2. Test Green
    3. Switch traffic from Blue to Green
    4. Keep Blue for quick rollback
  • Pros: Zero downtime, instant rollback
  • Cons: Double resources

Canary Deployment

  • Process:
    1. Deploy new version to small % of servers (5%)
    2. Monitor errors, performance
    3. Gradually increase % (10%, 25%, 50%, 100%)
    4. Rollback if issues
  • Pros: Lower risk, real-world testing
  • Cons: Slower rollout, complex routing

Rolling Deployment

  • Process: Update servers one-by-one (or in batches)
  • Pros: No extra resources
  • Cons: Mixed versions during rollout, slower rollback

Feature Flags

  • Deploy code with features disabled
  • Enable features gradually (per user, %)
  • Pros: Decouple deployment from release, A/B testing
  • Cons: Code complexity, technical debt

Service Discovery

Problem

  • Microservices need to find each other
  • IPs/ports change dynamically (scaling, failures)

Client-Side Discovery

  • Process: Client queries service registry, chooses instance, makes request
  • Examples: Netflix Eureka
  • Pros: Client controls load balancing
  • Cons: Client complexity, tight coupling to registry

Server-Side Discovery

  • Process: Client requests load balancer, load balancer queries registry
  • Examples: AWS ELB, Kubernetes Service
  • Pros: Client simplicity
  • Cons: Load balancer is potential bottleneck/SPOF

Service Registry

  • Examples: Consul, etcd, ZooKeeper
  • Features:
    • Health checks
    • Automatic deregistration of failed services
    • DNS interface

Kubernetes Service Discovery

  • Built-in via DNS and environment variables
  • ClusterIP service provides stable virtual IP
  • DNS: service-name.namespace.svc.cluster.local

API Gateway

Purpose

  • Single entry point for clients
  • Abstracts backend complexity

Responsibilities

  • Routing: Direct requests to appropriate microservice
  • Authentication/Authorization: Centralized security
  • Rate limiting: Protect backend services
  • Request/Response transformation: Adapt protocols/formats
  • Caching: Reduce backend load
  • Logging & Monitoring: Centralized observability
  • SSL termination: Handle TLS at gateway

Patterns

  • Backend for Frontend (BFF): Separate gateway per client type (web, mobile, IoT)

Examples

  • Kong, AWS API Gateway, Apigee, Zuul

Microservices vs Monoliths

Monolith

Pros:

  • Simple to develop, test, deploy (initially)
  • No network latency between components
  • Easier transactions (single DB)
  • Simpler debugging

Cons:

  • Scales as one unit (can't scale component independently)
  • Tech stack lock-in
  • Deployment risk (entire app redeployed)
  • Codebase becomes unwieldy

Microservices

Pros:

  • Independent scaling per service
  • Technology diversity
  • Fault isolation (one service failure doesn't crash all)
  • Faster deployments (small, independent)
  • Team autonomy

Cons:

  • Distributed system complexity (network failures, latency)
  • Data consistency challenges
  • Increased operational overhead (monitoring, deployment)
  • Testing complexity

When to Use

  • Monolith: Startups, simple domains, small teams
  • Microservices: Large orgs, complex domains, independent team scaling

Migration Strategy

  • Start with monolith
  • Extract services as domain understanding grows
  • "Strangler Fig" pattern (gradually replace monolith pieces)

9. Special Topics

Search Systems

Inverted Index (Deep Dive)

Structure:

Term[Doc1, Doc2, ...]

"hello"[doc1, doc3, doc5]
"world"[doc1, doc2]

With Positions (for phrase queries):

"hello"{doc1: [0, 15], doc3: [5]}

Search Process:

  1. Tokenize query: "hello world" → ["hello", "world"]
  2. Lookup each term in index
  3. Intersect posting lists: doc1 (appears in both)
  4. Rank results by relevance

Ranking Algorithms

TF-IDF (Term Frequency-Inverse Document Frequency)

  • TF: How often term appears in document
  • IDF: How rare term is across all documents
  • Score: TF × IDF (common terms in rare documents rank high)

BM25 (Best Match 25)

  • Improved TF-IDF with diminishing returns
  • Considers document length normalization
  • Industry standard

Elasticsearch Architecture

  • Cluster: Multiple nodes
  • Index: Collection of documents (like database)
  • Shard: Subset of index data (for horizontal scaling)
  • Replica: Copy of shard (for availability)

Query Types:

  • Match query (full-text search)
  • Term query (exact match)
  • Range query (dates, numbers)
  • Bool query (AND, OR, NOT)
  • Fuzzy query (typo tolerance)

Bloom Filters

Problem

  • Check if element exists in set
  • Traditional: Hash table (space inefficient for large sets)

Bloom Filter

  • Data structure: Bit array + k hash functions
  • Add: Set bits at k hash positions to 1
  • Check: If all k bits are 1, element might exist
  • False positives: Possible (bits set by other elements)
  • False negatives: Impossible (if bits set, element was definitely added or collision)

Use Cases

  • Database: Check if key exists before expensive disk lookup
  • Web: Block malicious URLs (quick check before full validation)
  • Distributed systems: Reduce unnecessary network calls
  • Example: Google Chrome uses bloom filters for malicious site detection

Trade-off

  • Space efficient (small bit array)
  • Tunable false positive rate (more bits/hashes = fewer false positives)

Recommendation Systems

Collaborative Filtering

User-based:

  • Find similar users (based on past behavior)
  • Recommend items those users liked
  • Example: Users who liked A and B also liked C

Item-based:

  • Find similar items (based on user interactions)
  • Recommend similar items to what user liked
  • Example: People who liked this movie also liked...

Matrix Factorization (Netflix Prize winner):

  • Decompose user-item matrix into latent factors
  • Predict missing ratings

Content-Based Filtering

  • Recommend based on item attributes
  • Example: User likes sci-fi movies → recommend other sci-fi

Hybrid Approaches

  • Combine collaborative + content-based
  • Cold start problem: Use content-based for new users/items

Ranking

  • Factors: Relevance, popularity, diversity, freshness
  • ML models: Gradient boosting, neural networks
  • A/B testing: Compare ranking algorithms

Distributed Transactions

Problem

  • Transaction spans multiple databases/services
  • Need ACID guarantees across systems

Two-Phase Commit (2PC)

Phase 1 (Prepare):

  1. Coordinator asks all participants: "Can you commit?"
  2. Participants lock resources, respond yes/no

Phase 2 (Commit/Abort):

  1. If all said yes: Coordinator sends "commit" to all
  2. If any said no: Coordinator sends "abort" to all

Problems:

  • Blocking: If coordinator crashes, participants locked
  • Single point of failure: Coordinator
  • Performance: Synchronous, slow

Saga Pattern

Concept: Break transaction into local transactions, compensate on failure

Example (booking trip):

  1. Book flight (local tx)
  2. Book hotel (local tx)
  3. Book car (local tx) If step 3 fails → compensate: cancel hotel, cancel flight

Types:

  • Choreography: Services communicate via events (decentralized)
  • Orchestration: Central coordinator (like 2PC but async)

Pros: No blocking, better availability Cons: Eventual consistency, complex compensation logic

When to Use

  • 2PC: When strong consistency absolutely required (rare)
  • Saga: Most distributed systems (accept eventual consistency)
  • Avoid distributed transactions: Design to avoid need (bounded contexts)

Consensus Algorithms

Why Needed?

  • Distributed systems need to agree on values
  • Leader election, configuration, distributed locks

Raft (Understandable Consensus)

Roles:

  • Leader: Handles all client requests
  • Follower: Passive, replicate leader's log
  • Candidate: Follower becomes candidate during election

Leader Election:

  1. Leader sends heartbeats
  2. If follower doesn't hear heartbeat (timeout) → becomes candidate
  3. Candidate requests votes from other nodes
  4. Majority votes → becomes leader

Log Replication:

  1. Leader receives command, appends to log
  2. Sends log entry to followers
  3. When majority replicate → entry committed
  4. Leader notifies followers, applies to state machine

Guarantees:

  • Only one leader per term
  • Logs eventually identical across servers
  • Committed entries durable

Paxos

  • More complex than Raft (harder to understand/implement)
  • Three roles: Proposers, Acceptors, Learners
  • Multi-Paxos optimized for multiple decisions
  • Used in: Google Chubby, Spanner

Practical Usage

  • Don't implement yourself: Use existing (etcd, Consul, ZooKeeper)
  • Use for: Leader election, distributed config, locking
  • Not for: Every coordination need (too heavyweight)

Time & Ordering in Distributed Systems

Problem

  • No global clock in distributed systems
  • Clock skew (servers have different times)
  • Need to order events across servers

Lamport Clocks (Logical Time)

  • Each process maintains counter
  • Rules:
    1. Increment counter before each event
    2. Send counter with message
    3. Receiver sets counter = max(local, received) + 1
  • Property: If event A happened-before B, then timestamp(A) < timestamp(B)
  • Limitation: Converse not true (can't determine causality from timestamps alone)

Vector Clocks

  • Each process maintains vector of counters (one per process)
  • Example: [P1:3, P2:5, P3:2]
  • Rules:
    1. Increment own counter on event
    2. Send entire vector with message
    3. Receiver merges vectors (max of each component)
  • Property: Can determine causality
    • A happened-before B: VA < VB (component-wise)
    • Concurrent events: Neither VA < VB nor VB < VA

Use Cases

  • Lamport clocks: Total ordering of events (distributed snapshots)
  • Vector clocks: Conflict detection (Riak, Dynamo)
    • Example: Detect concurrent updates to same key

True Time (Google Spanner)

  • Uses atomic clocks + GPS for global time
  • Time is interval (t ± ε) accounting for uncertainty
  • Wait out uncertainty before committing (ensures causality)

10. Additional Important Topics

Back-of-the-Envelope Calculations

Common Numbers (Latency)

  • L1 cache: 0.5 ns
  • L2 cache: 7 ns
  • RAM: 100 ns
  • SSD read: 150 μs
  • HDD seek: 10 ms
  • Network within datacenter: 0.5 ms
  • Round trip CA to Netherlands: 150 ms

Storage Capacity

  • 1 KB = 1,000 bytes
  • 1 MB = 1,000 KB
  • 1 GB = 1,000 MB
  • 1 TB = 1,000 GB
  • 1 PB = 1,000 TB

Traffic Estimates

  • Example: 100M DAU, average 10 requests/day
    • QPS = 100M × 10 / 86400 ≈ 11,574 req/s
    • Peak QPS ≈ 2-3× average ≈ 30,000 req/s

Storage Estimates

  • Example: 1M tweets/day, 280 chars average, 5 years retention
    • 280 bytes × 1M × 365 × 5 ≈ 500 GB

Polling vs Push vs Long Polling

Polling

  • Client periodically requests updates
  • Pros: Simple, stateless
  • Cons: Wasted requests (if no updates), delayed updates

Push (WebSockets, SSE)

  • Server pushes updates when available
  • Pros: Real-time, efficient
  • Cons: Complex, stateful connections

Long Polling

  • Client requests, server holds until update available (or timeout)
  • Pros: More real-time than polling, better compatibility than WebSockets
  • Cons: Still overhead of reconnections

Database Connection Pooling

  • Problem: Creating DB connections is expensive (TCP handshake, auth)
  • Solution: Pool of reusable connections
  • Benefits: Faster response, controlled max connections
  • Configuration: Min/max pool size, connection timeout, idle timeout

Partitioning vs Sharding

  • Often used interchangeably
  • Partitioning: Splitting data (can be on same server)
    • Horizontal: Split rows (same schema)
    • Vertical: Split columns (different tables)
  • Sharding: Horizontal partitioning across multiple servers

Webhooks

  • Concept: Server calls client URL when event occurs
  • Use cases: Payment notifications, GitHub push events
  • Considerations:
    • Retry logic (client might be down)
    • Idempotency (duplicates possible)
    • Security (validate sender, HTTPS)

Reverse Hash Lookup (Distributed Hash Table)

  • Use case: P2P systems (BitTorrent, blockchain)
  • Concept: Hash key maps to node responsible for storing it
  • Consistent hashing: Add/remove nodes with minimal reshuffling

Interview Preparation Tips

How to Approach System Design Interviews

1. Clarify Requirements (5 min)

  • Functional: What features? (read/write, search, notifications)
  • Non-functional: Scale (users, requests/sec), latency, availability
  • Constraints: Budget, timeline, existing infrastructure

2. Back-of-Envelope Estimates (5 min)

  • Calculate QPS, storage, bandwidth
  • Determine scale tier (thousands vs millions vs billions)

3. High-Level Design (10-15 min)

  • Draw main components (client, load balancer, servers, databases, cache)
  • API design (key endpoints, request/response)
  • Data model (tables, relationships)

4. Deep Dive (15-20 min)

  • Interviewer will probe specific areas
  • Be ready to discuss: scaling, failures, bottlenecks, trade-offs
  • Common deep dives: Database choice, caching strategy, consistency model

5. Wrap Up (5 min)

  • Monitoring, metrics, alerts
  • Potential improvements, future scaling

Common Mistakes to Avoid

  • Jumping to solution without clarifying requirements
  • Over-engineering (don't add Kafka if simple queue suffices)
  • Ignoring trade-offs (every decision has pros/cons)
  • Not considering failures (what if DB goes down?)
  • Forgetting about monitoring/observability

Key Trade-offs to Discuss

  • Consistency vs Availability (CAP theorem)
  • Latency vs Throughput (batch processing vs real-time)
  • Normalization vs Denormalization (storage vs query speed)
  • SQL vs NoSQL (ACID vs scalability)
  • Monolith vs Microservices (simplicity vs scalability)
  • Synchronous vs Asynchronous (simplicity vs performance)

Practice Questions

  • Design Twitter/Instagram
  • Design URL shortener
  • Design video streaming (YouTube/Netflix)
  • Design messaging system (WhatsApp)
  • Design ride-sharing (Uber)
  • Design newsfeed
  • Design web crawler
  • Design search autocomplete
  • Design rate limiter
  • Design distributed cache
  • Design key-value store
  • Design notification system

Quick Reference

When to Use SQL vs NoSQL

Use SQL when:

  • Need ACID transactions
  • Complex queries with joins
  • Structured, relational data
  • Data integrity critical

Use NoSQL when:

  • Massive scale (horizontal scaling)
  • Flexible schema
  • High write throughput
  • Eventual consistency acceptable

Caching Decision Tree

  1. Frequently accessed data? → Yes → Cache it
  2. Read-heavy or write-heavy?
    • Read-heavy → Cache-aside
    • Write-heavy → Write-through or write-behind
  3. Consistency critical?
    • Yes → Write-through
    • No → Cache-aside with TTL

Database Replication Strategy

  • Read-heavy → Master-slave
  • Write-heavy + global → Master-master
  • Strong consistency → Master-slave with sync replication
  • High availability → Master-master or multi-region

Message Queue vs Database

  • Use Queue when: Async processing, decoupling, load leveling
  • Use Database when: Need to query data, ACID required, persistent storage

This guide covers the essential system design topics for interviews. Remember: there's rarely one "correct" answer in system design. Focus on demonstrating your thought process, understanding trade-offs, and designing for the stated requirements.